Remove HF_TOKEN dependency in E2E test #357

jack8558 · 2025-08-05T19:04:28Z

Removing HF_TOKEN dependency in E2E test

Created tp save_hf_model_files_to_gcs to save huggingface model files in gcs
Saved tokenizers and llama3-8B's model weights and configs in gcs bucket (the weights and configs are needed for SFT e2e test)
Since huggingface library can't load directly from gcs, added util funciton copy_gcs_to_local which download in tmp directory
Removed HF_TOKEN on e2e test and cpu test

tp save_hf_model_files_to_gcs example:

tp save-hf-model-files-to-gcs \
  --repo-id "meta-llama/Meta-Llama-3-8B" \
  --gcs-path "gs://bucket" \
  --file-type "all" \
  --temp-dir /mnt/disks/tmp

#14

…ched tokenizer

jialei777 · 2025-08-12T16:14:07Z

thank you for putting together this. My question is since the models and tokenizers are gated (and under certain term by meta), are they allowed to be saved in gcp bucket and distributed by us?

vlasenkoalexey · 2025-08-12T18:27:29Z

thank you for putting together this. My question is since the models and tokenizers are gated (and under certain term by meta), are they allowed to be saved in gcp bucket and distributed by us?

Putting weights for our own use in e2e tests is fine, distributing weights and tokenizers publicly is not.

.github/workflows/e2e_test.yml

vlasenkoalexey · 2025-08-12T18:33:22Z

torchprime/torch_xla_models/model/base_causal_lm.py

@@ -153,8 +155,13 @@ def _maybe_save_checkpoint(self, config: DictConfig) -> None:
    # Step 3: Save the HF config files and tokenizer
    if xr.process_index() == 0:
      logger.info("Saving Hugging Face configs and tokenizer to %s", save_dir)
-      model_utils.copy_hf_config_files(config.model.pretrained_model, save_dir)
-      model_utils.save_hf_tokenizer(config.model.pretrained_model, save_dir)
+      # Copy to local if in GCS


Could you explain why is it necessary?
If training started from gcs fuse which looks like a local folder, would it still try to copy?

This was needed because gcs bucket that we are loading toeknizer is not mounted by gcsfuse.

The bucket we mount in thunk.py is artifact_dir. The implementaion in this PR copies GCS content to local using gsutil instead of using gcsfuse.

vlasenkoalexey · 2025-08-12T18:36:43Z

torchprime/torch_xla_models/model/model_utils.py

+    )
+
+  local_dir = tempfile.mkdtemp()
+  _TEMP_DIRS_TO_CLEAN.append(local_dir)


This is a bad pattern, could you make temp dir as an argument or use context manager to auto clean it?
If that's inconvenient, feel free to leave it as is.

Updated the function with context managager. Lmk if this look better.

…o auto clean

…move-hf-toekn-e2e

torchprime/launcher/cli.py

torchprime/launcher/save_hf_tokenizer_and_model.py

…move-hf-toekn-e2e

jack8558 added 2 commits August 5, 2025 19:01

Add tp command to save tokenizer to gcs and modify e2e test to use ca…

df5f067

…ched tokenizer

Ruff formatting

aefad4a

jack8558 changed the title ~~DRAFT~~ Remove HF_TOKEN dependency in E2E test Aug 5, 2025

jack8558 added 9 commits August 6, 2025 21:19

Save pretrained weights for SFT e2e

0560ffb

Formatting

5feb9dd

Fix issue in loading weights from gcs

a0f68c1

Formatting

dca490c

Refactor

19da2b1

Remove unused lines

1cf724b

Refactor

f23f46d

Use e2e gcs directory and refactored

c3edee2

Formatting

15ef3eb

jack8558 linked an issue Aug 8, 2025 that may be closed by this pull request

Remove dependency on HuggingFace token #14

Closed

jack8558 requested review from vlasenkoalexey and jialei777 August 8, 2025 01:41

jack8558 marked this pull request as ready for review August 8, 2025 15:31

vlasenkoalexey requested changes Aug 12, 2025

View reviewed changes

jack8558 added 6 commits August 13, 2025 22:41

Change copy_gcs_to_local to gcs_to_local which uses context manager t…

0e59528

…o auto clean

Merge branch 'main' of github.com:AI-Hypercomputer/torchprime into re…

418f726

…move-hf-toekn-e2e

Remove hf_token on deepseek e2e

766afb3

Fix error callling wrong function name

91d71b8

Change function name to be more descriptive

14dc419

Ruff format

c728389

jack8558 requested a review from vlasenkoalexey August 13, 2025 23:50

vlasenkoalexey approved these changes Aug 14, 2025

View reviewed changes

torchprime/launcher/cli.py Outdated Show resolved Hide resolved

torchprime/launcher/save_hf_tokenizer_and_model.py Outdated Show resolved Hide resolved

jack8558 added 3 commits August 14, 2025 17:55

Merge branch 'main' of github.com:AI-Hypercomputer/torchprime into re…

b0f7562

…move-hf-toekn-e2e

Fix error from previous merging

b33a9da

Ruff format

a57b6b1

Address comment

aeef24a

jack8558 merged commit 3a1b818 into main Aug 15, 2025
27 of 29 checks passed

jack8558 deleted the jackoh/remove-hf-token-in-e2e-test branch August 15, 2025 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove HF_TOKEN dependency in E2E test #357

Remove HF_TOKEN dependency in E2E test #357

Uh oh!

jack8558 commented Aug 5, 2025 •

edited

Loading

Uh oh!

jialei777 commented Aug 12, 2025

Uh oh!

vlasenkoalexey commented Aug 12, 2025

Uh oh!

Uh oh!

vlasenkoalexey Aug 12, 2025

Uh oh!

jack8558 Aug 13, 2025

Uh oh!

vlasenkoalexey Aug 12, 2025

Uh oh!

jack8558 Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove HF_TOKEN dependency in E2E test #357

Remove HF_TOKEN dependency in E2E test #357

Uh oh!

Conversation

jack8558 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jialei777 commented Aug 12, 2025

Uh oh!

vlasenkoalexey commented Aug 12, 2025

Uh oh!

Uh oh!

vlasenkoalexey Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

jack8558 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

vlasenkoalexey Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

jack8558 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jack8558 commented Aug 5, 2025 •

edited

Loading